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Do you remember yours? 
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ZX Spectrum 
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8-bit CPU 



A CPU most fondly remembered 
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Golden Oldies 


Float24Mul: 

call 

jr nc,$+4 

jr nc,$+4 

Id c,h 


;BHL*CDE -> 

BC_Times_DE 

add hl,de 

add hl,de 

;1 

add hi, hi 

AHL 

bit 7,1 

adc a,b 

adc a,b 

add a, a 

rla 

Id a,b \ xor c 

Id l,h 

;227+10b-7p 


jr nc,$+4 

jr nc,$+4 

and 80h 

Id h,b 

add hi, hi 

add hi, hi 

Id h,d 

add hi ,de 

push af 

jr z,$+3 

rla 

rla 

Id l,e 

adc a,c 

sla b \ sra b 

inc hi 

jr nc,$+4 

jr nc,$+4 

;2 


sla c \ sra c 

pop af 

add hl,de 

add hl,de 

add hi, hi 

add hi, hi 

Id a,b 

ret 

adc a,b 

adc a,b 

rla 

rla 

add a,c 

BC_Times_DE: 



jr nc,$+4 

jr nc,$+4 

pop be 

;BHLA is the 

add hi, hi 

add hi, hi 

add hi ,de 

add hi ,de 

jp m,$+10 

result 

rla 

rla 

adc a,c 

adc a,c 

cp 64 

Id a,b 

jr nc,$+4 

jr nc,$+4 

;227+10b-7p 


jr nc,Setlnf-2 

or a 

add hl,de 

add hl,de 

add hi, hi 

add hi, hi 

jp $+7 

Id hl,0 

adc a,b 

adc a,b 

rla 

rla 

cp -64 

Id b,h 



jr nc,$+4 

jr nc,$+4 

jr c,Setlnf-2 

;1 

add hi, hi 

J 

add hi ,de 

add hi ,de 

;B has the right 

add a, a 

rla 

;AHL is the result 

adc a,c 

adc a,c 

sign 

jr nc,$+4 

jr nc,$+4 

of B*DE*256 



and 7Fh 

Id h,d 

add hl,de 

push hi 

add hi, hi 

add hi, hi 

or b 

Id l,e 

adc a,b 

Id h,b 

rla 

rla 

push af 

;2 


Id l,b 

jr nc,$+4 

jr nc,$+4 

Id b,h 

add hi, hi 

add hi, hi 

Id b,a 

add hi ,de 

add hi ,de 

Id c,l 

rla 

rla 

Id a,c 

adc a,c 

adc a,c 


pop de 
;Now 

BDE*256+AHL 
Id c,a 
Id a, I 
Id l,h 
Id h,c 
add hl,de 
ret nc 
inc b 

;BHLA is the 32- 
bit result 
ret 
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Brave New World 


mulss 1 float 
mulps 4 floats 
vmulps 8 floats 
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I ve got the Power! 


virtual void Foo ( ) ; 

• • • 

pObject->Foo () ; 

•Fetch object pointer 
•Fetch Vtable pointer 
•Fetch Vtable entry 


8 





Main Memory Is Not Slow 
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8-36 G/s 
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a->b->c 


>d-> . . 



P [ 0 ] 

p [ 1 ] 

P [2 ] 

P [ 3] 






\ 
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Cache lines are indivisible 




Pack & A an 



15 


8 G/s 



One Cycle per Byte 


~2 G/s 


~2 G/s 


-2 G/s 


-2 G/s 


2 GHz 
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Measure 



Estimate 

Verify 



Repeat 
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Compute-Bound 


Predictable 
We are trained for this 

Hard to achieve 
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Cache Misses 


...they re bad 
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Cache Miss: Ticks 


■ Last-gen PowerPC 


Modern X86/AMD64 



Ticks 
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3+ I PC and beyond 


{ mov %r 1 1 , %r 1 0 

mov %rsi f %rcx 
mov %rl3 , %rl2 
mov %r!5 , %r!4 


inc %rll 
inc %rlO 
inc %rsi 
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Cache miss: RealCost (tm) 
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■ Last-gen PowerPC 



Ticks 


Modern X86/AMD64 



Instructions Delayed* 


* Not-So-Rigorous Analysis (tm) 
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000 FTW! 


000 Limits 
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000 Limits IRL 





Latency and Throughput 


mulps 

mulps 

mulps 

mulps 


Solve 


x = 


y = 

z = 
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Think Short Chains 


More opportunities to reorder 
Many cache misses resolve at once 
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Real CPU Latency 


Latency 



Real Latency* 


* The Real Latency: a technical term (tm) 
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Prefetch 


Linear Streams 
Cache Miss Overlap 
Fix All or Nothina 


35 



Maxing RAM out 






RAM 


Load Balance 
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Prefetch Techniques 


Use Linear Streams 
Rely on Reordering 
Manual prefetch 
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37 GB/s? Really? 



1 Stream 2 Streams 2x2 Streams Many Streams 
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8 G/s 



2 GHz 
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Cache Misses 


...beware of ‘em 
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Bandwidth: take 2 


1 . Load Data 

2. ...Other Stuff (150+ cycles) 

3. Use Data 
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Bandwidth: Hi-End 
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Load Hit Store 


int32 


Don t M x 



Stores affect cache 

(pollution, migration) 
Branches are inexpensive 
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Branch (mis)prediction 


Predictability=performance 

1-4 deep history 
random = 50% mispredicted 
misprediction is -10-20 ticks 
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Branch Prediction 


predicted branch ~0-1 tick 
100% consistent = -100% predicted 
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Careful: Recursion 


Return Stack Buffer 
Typically 8-24 deep 
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Case study: cloth 

• 16 verts 

• 1.6 Instructions per Cycle 

• 600 cycles 

• 5 cache lines streamed 

• 8 cache lines random access 

• ~1 .4 bytes per cycle 

• ALU bound, RAM: 4.7 G/s 
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Cycles / Iteration 


nStart = rdtsc ( ) ; 

for ( int i = 0; i < N; ++i) 

Onelteration ( ) ; 

Cycles += rdtsc ( ) - nStart; 

Iterations += N; 
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Derived Metrics 


Instructions per cycle 
Cycles per byte 
Instructions per byte 
Bytes per second 
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Reorder Buffer 









Summary 


• Cache 

• 000 

• RAM 

• 1 byte per op 
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Small Pointers 
Fixed Addresses 


Rulez! 


8-bit 

64-bit 

Yes 

No 

Yes 

No 
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Relative Pointer 


RPointer< Foo > 


Same in 32- and 

64-bit 

Moveable 


Yes 

Yes 


Foo * 

No 

No 
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Relative Pointer 




if\ 

0000 
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16-bit — > +/-32KB 
32-bit — > +/-2GB 
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RPointer<T> 


template <typename T> 
class RPointer 

-r 



fymrrrrr" " - — - 

T* operator -> ( ) 

{ 

Assert( m_nOffset != 0 ); // must not be NULL 
return ( ( byte* )&m_nOffset ) + m_nOffset; 

} 

void operator = (T*p){m_nOffset = p?( (byte*)p)-(byte*)this:0;} 

}; 
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Cache Misses.. 


...did I mention ‘em? 
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Filling In 




Commit 
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Pointer vs Offset 


struct Elem 

T 

Elem *m_pNext; 
int m_nData; 

}; 

■ ■■ 

Elem *p = new(pAllocator) Elem; 
p- > m_pNext = NULL; 
p->m_nData = 123; 


struct Elem 

T 

RPointer<Elem> m_pNext; 
int m_nData; 

}; 

■ ■■ 

Elem *p = new(pAllocator) Elem; 
p- > m_pNext = NULL; 
p->m_nData = 123; 
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Strengths 


• Very simple 

• No runtime overhead 

• Data is relocatable 

• Data is more compact 
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Limitations 


• No Automatic Versioning 

• Continuous Data Block 

• More Stuff to Learn 
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Summary 


• Cache 

• 000 

• RAM 

• 1 byte per op 

• Resource Pointers 
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Big 
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Trie 
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How useful is 0*7 
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“0 


Best optimization ever 


Do noth no 
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0 1 


•Cache it 

•Do it once per frame 
•Approximate it 
•Be creative... 
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Array 


D(200 


* 
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Array 


>00 


* 
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O * nuances 


Array 

Linked List 


Traversal 

0(N) 

0(N) 


Cache Misses 
in Traversal 

0 ( 0 ) 

O(N) 
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Map vs Hashtable 



Classic RB-Tree 

Classic Hash 

Table 

Worst-Case 

0(log N) 

O(N) 

Amortized 

0(log N) 

0(1) 
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Priorities 


•Algorithm vs Cache 
•Cl * N vs C2 * log N 
•CPU vs Memory 



Cache-Friendly Algorithms 


Depends on N, Cl and C2 



Where’s the Bottleneck? 
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Cache misses... 


. . .the wildcard of pert 
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What to SlMDize 





Repetitive 

Independent 

Computations 


84 





How to SlMDize 


Debug Scalar code 
Have No Branches 
Interleave data 
SlMDize code 
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Gat 










L: 

I 

j 
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mov 

mov 

mov 

mov 

shuffle 

shuffle 

shuffle 





Scatter 
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8 shuffl 

9 shuffl 

10 shuffl 

11 mov 

12 mov 

13 mov 

14 mov 








Fre 


Ports #0,#1 ,#5 


Port #2 


Port #3 


Port #4 


Lunch for SIMD 


Full 
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Swizzle-Pack Data 



1 mov 

2 mov 
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Gather-Scatter 


Pack your data 
Gather-Scatter pipeline well 

1 op per byte 
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Interleave 


ul6 | ul6 

ul6 | ul6 



float 

float 

float 

float 


X 

X 

X 

X 



y y y y 

z z z z 
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template <typename Float> 

template <typename Float > 

inline SinCos< Float> ApproximateGivensRotation( const Float & all, const Float & al2, 
const Float & a22 ) 

{ 

const Float two = Replicate<Float> ( 2. Of ); 

Float ch = two * ( all - a22 ) ; 

Float sh = al2; 

typename FloatTraits<Float>: :Bool b = CmpLt( Replicate<Float> ( 5 . 82842712474619f ) * 

sh*sh, ch*ch ); 

Float omega = RsqrtEst( ch*ch + sh *sh ); 

SinCos< loat>res; 

res.s = Select( Replicate<Float> ( 0 . 3826834323650897717284599840304f ), omega * sh, b ); 
res.c = Select( Replicate<Float> ( 0 . 92387953251128675612818318939679f ), omega * ch, b ); 

return res; 

} 
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SIMD data structures 


struct Rod 
{ 

uintl6 nNode[ 2 ]; 
float flMaxDist; 
float flMinDist; 
float flMassRatio; 
float f lRelaxationFactor ; 

}; 


struct SimdRod 

{ 

uintl6 nNode[ 2 ][ 4 ]; 

float 4 f 4MaxDist ; 

float 4 f 4MinDist ; 

float4 f4MassRatio; 

float4 f 4RelaxationFactor ; 

void Init( const Rod pScalar[4] ); 
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Cloth SIMD order 



SIMD lanes 


•Homogenous 

•Independent 


Semi-Myth: SIMD is premature optimization 
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SIMD branches (ala GPU) 


true true false false 


if ( condition ) 

DoThen ( ) ; 

else 

DoElse ( ) 


SIMD Select 


DoThen ( ) 


DoElse ( ) 
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SIMD branches (multipass) 

1 . Sort into batches 

2. Process batches 

3. Merge results 

I# 7|# 8 

#1#2#6#9 
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SIN 


The Good: 

• Almost 4x pert 

• Separable=Easy 

• AVX: 66% users 

• Easy to Write 


98 


The Bad: 

• Hard to Retrofit 

• Hard to Automate 

• Requires Planning 

• Hard to Read 



Performance Re-Cap 


• Cache 

• OOO 

• RAM 

• 1 byte per op 

• Resource Pointers 

• SIMD 

• Multi-Thread 
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Special Thanks 


3D Art Support - Anna Bibikova 
2D Art Support - Heather Campbell 
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Links 


Agner Fog Software Optimization Resrouces 

http://agner.org 

Software optimization guide for AMD processors 
http://support.amd.com/TechDocs/251 1 2.PDF 

Intel® 64 and IA-32 Architectures Optimization Reference Manual 

http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optinnization-manual.html 

Intel® Architecture Code Analyzer 

https://software.intel.com/en-us/articles/intel-architecture-code-analyzer 


The Art of Multiprocessor Programming 

Intel VTune 
VerySleepy 
GlowCode 
Luke Stackwalker 
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Attribution 

• Images of Z80: "Z80A-HD 11 by ZeptoBars - Licensed under CC BY 3.0 via Wikimedia 
Commons - http://commons.wikimedia.org/wiki/File:Z80A-HD.jpg#mediaviewer/File:Z80A- 

HD.j pg 

• Image of Haswell wafer - licensed under the Creative Commons Attribution 2.0 Generic 
license, http://en.wikipedia.org/wiki/l-laswell (microarchitecture)#mediaviewer/ 

File:Haswell Chip.j pg 

• Image of Radio-86RK: created by Audriusa, available under the Creative Commons CCO 
1.0 Universal Public Domain Dedication, https://commons.wikimedia.org/wiki/ 
File:Radio86RK.png 

• Listing of 24-bit float multiplication for Z80: https://drive.google.com/folderview? 
id=0B4HNIXQZLWM8Z3NQMGJTbHVhRm8&usp=sharing, licensed for free use and 
distribution 

• ZX Spectrum picture - http://commons.wikimedia.Org/wiki/File:ZXSpectrum48k.j pg - CC2.5 
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Performance Recipe 

• Better Algorithm! 

• 1 byte per op 

• Stream 

• Prefetch 

• SlMDize 

• Multithread 

• Profit! 
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Videos, Links and Errata 


ittp://seraiv.space 
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